Understand the fundamental concepts of Bayesian statistical modelling.
Be able to understand articles describing Bayesian analyses.
Bonus: realise that the Bayesian approach is more intuitive than the frequentist approach.
Practical objectives:
Be able to carry out a complete analysis (i.e., identifying the appropriate model, writing the mathematical model, implementing it in R, interpreting and reporting the results) of a simple dataset.
Planning
Course n°01: Introduction to Bayesian inference, Beta-Binomial model Course n°02: Introduction to brms, linear regression Course n°03: Markov Chain Monte Carlo, generalised linear model Course n°04: Multilevel models, cognitive models
\[\newcommand\given[1][]{\:#1\vert\:}\]
A matter of interpretation
What is the probability…
Of obtaining an odd number with a fair die?
Of you learning something cool during this course?
Do these two questions refer to the same “sort” of probability?
Classical (or theoretical) interpretation
\[
\Pr(\text{odd}) = \frac{\text{number of favorable issues}}{\text{total number of possible issues}} = \frac{3}{6} = \frac{1}{2}
\]
Problem: this definition only applies to situations in which there is a finite number of equiprobable potential outcomes…
Limitation: what is the probability of raining tomorrow?
Where \(n_{x}\) is the number of occurrences of the event \(x\) and \(n_{t}\) the total number of observations. The frequentist interpretation postulates that, in the long-run (i.e., when the number of observations approaches infinity), the relative frequency of an event will converge exactly with what we call “probability”.
Consequence: the concept of probability only applies to collectives, not to single events.
Frequentist (or empirical) interpretation
library(tidyverse)sample(x =c(0, 1), size =500, prob =c(0.5, 0.5), replace =TRUE) %>%data.frame() %>%mutate(x =seq_along(.), y =cummean(.) ) %>%ggplot(aes(x = x, y = y) ) +geom_line(lwd =1) +geom_hline(yintercept =0.5, lty =3) +labs(x ="Trial number", y ="Proportion of heads") +ylim(0, 1)
Limitations of the frequentist interpretation
Which reference class? What is the probability that I will live until 80 years old? As a man? As a French person?
What about non-repeatable events? What is the probability of you learning something cool during this course?
The resolution issue: How many observations do we need to get a good approximation of the underlying probability? A finite class of events of size \(n\) can only produce relative frequencies with precision \(1/n\)…
Propensionist interpretation
The frequentist (i.e. long-term) properties of objects (e.g., a coin) are caused by the intrinsic physical properties of the objects. For example, a biased coin will generate a biased relative frequency (and therefore probability) because of its physical properties. For propensionists, probabilities represent these intrinsic characteristics, these propensities to generate certain relative frequencies, and not the relative frequencies themselves.
Consequence: these properties are the properties of individual events… and not of sequences! The propensionist interpretation therefore allows us to talk about the probability of single events.
Logical interpretation
There are 10 students in this room
9 wear a green t-shirt
1 wears a red t-shirt
One person is drawn at random…
Conclusion #1: the student drawn wears a t-shirt ✔
Conclusion #2: the student drawn wears a green t-shirt ✗
Conclusion #3: the student selected at random wears a red t-shirt ✗
Logical interpretation
The logical interpretation of the concept of probability attempts to generalise logic (true/false) to the probabilistic world. Probability therefore represents the degree of logical support that a conclusion can have, relative to a set of premises (Carnap, 1971; Keynes, 1921).
Consequence: all probability is conditional.
Bayesian interpretation
Probability is a measure of the degree of uncertainty. An event that is certain will therefore have a probability of 1 and an event that is impossible will have a probability of 0.
So to assign equal probabilities to two events is not in any way an assertion that they must occur equally often in any random experiment […], it is only a formal way of saying I don’t know (Jaynes, 1986).
To talk about probabilities in this context, we no longer need to refer to the limit of occurrence of an event (frequency).
A probability is a numerical value assigned to an event \(A\), understood as a possibility belonging to the universe \(\Omega\) (the set of all possible outcomes).
Probabilities have to conform to the following axioms:
The last axiom is also known as the sum rule and can be generalised to non-mutually exclusive events: \(\Pr(A_{1} \cup A_{2}) = \Pr(A_{1}) + \Pr(A_{2}) - \Pr(A_{1} \cap A_{2})\).
Sum rule and product rule
Sum rule (for two mutually exclusive events): \(\Pr(A_{1} \cup A_{2}) = \Pr(A_{1}) + \Pr(A_{2}) - \Pr(A_{1} \cap A_{2})\).
Think about the probability of getting an odd number with a fair die. We may also write it as \(\Pr(x = 1) + \Pr(x = 3) + \Pr(x = 5) = \frac{3}{6}\).
Think about the probability of getting two 6s in a row with a fair die. We may also write it as \(\Pr(x = 6, y = 6) = \frac{1}{6} \times \frac{1}{6} = \frac{1}{36}\).
If you understand and remember these two rules, you already know Bayesian statistics!
Joint probability
Probability that die \(x\) is equal to 2 and die \(y\) is equal to 3 is: \(\Pr(x = 2, y = 3) = \Pr(y = 3, x = 2) = \frac{1}{36}\).
From the sum rule to marginalisation
With more than one variable, the sum rule tells us how to ignore one. For instance, the probability that the first die shows 1 is: \(\Pr(x = 1) = \Pr(x = 1, y \in \{1, 2, 3, 4, 5, 6\}) = \frac{6}{36}\). This is called marginal because you can write the cumulative probability in the margin of a joint probability table.
From the sum rule to marginalisation
Probability that two dice total 4 is: \(\Pr(x + y = 4) = \frac{3}{36}\).
Conditional probability
What is the probability that die \(x\) equals some value given that the total is 4? For instance, the probability of die \(x\) being equal to 2: \(\Pr(x = 2 \given x + y = 4) = \frac{1}{3}\).
Conditional probability
This conditional probability can be rewritten: \(\Pr(x = 2 \given x + y = 4) = \frac{\Pr(x = 2, x + y = 4) }{\Pr(x + y = 4) } = \frac{1/36}{3/36} = \frac{1}{3}\). Note that \(\Pr(x \given y)\)is not necessarily equal (and is generally not equal) to \(\Pr(y \given x)\).
For instance: the probability of dying knowing that you have been attacked by a shark is not the same as the probability of having been attacked by a shark knowing that you are dead (confusion of the inverse). In the same way, \(p(\text{data} \given \mathcal{H}_{0}) \neq p(\mathcal{H}_{0} \given \text{data})\).
Deriving Bayes theorem
From Kolmogorov’s axioms and the previous definitions of joint, marginal, and conditional probabilities, we derive the product rule (for dependent events):
Let’s imagine we have a bag containing 4 marbles. These marbles can be either white or blue. We know that there are precisely 4 marbles, but we don’t know the number of marbles of each colour.
However, we do know that there are five possibilities (which we consider to be our hypotheses):
The aim is to determine which combination is the most likely, given certain observations. Let’s imagine that we draw three marbles in a row, with replacement, and we obtain the following sequence: 🔵⚪🔵.
This sequence represents our data. What inference can we make about the contents of the bag? In other words, what can we say about the probability of each hypothesis?
⚪⚪⚪⚪
🔵⚪⚪⚪
🔵🔵⚪⚪
🔵🔵🔵⚪
🔵🔵🔵🔵
02:00
Enumerating possibilities
Hypothesis: 🔵⚪⚪⚪
Data: 🔵
Enumerating possibilities
Hypothesis: 🔵⚪⚪⚪
Data: 🔵⚪
Enumerating possibilities
Hypothesis: 🔵⚪⚪⚪
Data: 🔵⚪🔵
Enumerating possibilities
Hypothesis: 🔵⚪⚪⚪
Data: 🔵⚪🔵
Enumerating possibilities
Under this hypothesis, \(3\) paths (out of \(4^{3} = 64\)) lead to the result obtained. What about the other hypotheses?
Comparing hypotheses
Given the data, the most probable hypothesis is the one that maximises the number of possible ways of obtaining the data obtained.
Hypothesis
Way to obtain the data
⚪⚪⚪⚪
0 x 4 x 0 = 0
🔵⚪⚪⚪
1 x 3 x 1 = 3
🔵🔵⚪⚪
2 x 2 x 2 = 8
🔵🔵🔵⚪
3 x 1 x 3 = 9
🔵🔵🔵🔵
4 x 0 x 4 = 0
Evidence accumulation
In the previous case, we considered that all the hypotheses were equiprobable a priori (according to the indifference principle). However, we could have a priori information from our knowledge (of the characteristics of the bags of marbles, for example) or from previous data.
Let’s assume we draw a new marble from the bag. How do we incorporate this new data?
Evidence accumulation
All we have to do is apply the same strategy as before, and update the last count by multiplying it by the new data. Yesterday’s posterior is today’s prior (Lindley, 2000).
Hypothesis
Ways to produce 🔵
Previous count
New count
⚪⚪⚪⚪
0
0
0 x 0 = 0
🔵⚪⚪⚪
1
3
3 x 1 = 3
🔵🔵⚪⚪
2
8
8 x 2 = 16
🔵🔵🔵⚪
3
9
9 x 3 = 27
🔵🔵🔵🔵
4
0
0 x 4 = 0
Incorporating prior knowledge
Now let’s suppose that an employee at the marble factory tells us that blue marbles are rare… This employee tells us that for every bag containing 3 blue marbles, they make two bags containing only two, and three bags containing only one. He also tells us that every bag contains at least one blue marble and one white marble…
Hypothesis
Previous count
Factory prior
New count
⚪⚪⚪⚪
0
0
0 x 0 = 0
🔵⚪⚪⚪
3
3
3 x 3 = 9
🔵🔵⚪⚪
16
2
16 x 2 = 32
🔵🔵🔵⚪
27
1
27 x 1 = 27
🔵🔵🔵🔵
0
0
0 x 0 = 0
From enumerations to probabilities
The probability of a hypothesis after observing certain data is proportional to the number of ways in which this hypothesis can produce the observed data, multiplied by its a priori probability.
To convert plausibilities to probabilities, all we have to do is standardise these plausibilities so that the sum of the plausibilities of all possible hypotheses is equal to \(1\).
We define \(p\) as the proportion of blue marbles in the bag.
Hypothesis
\(p\)
Ways to produce the data
Probability
⚪⚪⚪⚪
0
0
0
🔵⚪⚪⚪
0.25
3
0.15
🔵🔵⚪⚪
0.5
8
0.40
🔵🔵🔵⚪
0.75
9
0.45
🔵🔵🔵🔵
1
0
0
ways <-c(0, 3, 8, 9, 0)ways /sum(ways)
[1] 0.00 0.15 0.40 0.45 0.00
Notations
\(\theta\) is a parameter or vector of parameters (e.g., the proportion of blue marbles).
\(\color{orangered}{p(x \given \theta)}\)the conditional probability of the data \(x\) given parameter \(\theta\)\(\color{orangered}{[p(x \given \theta = \theta)]}\).
\(\color{orangered}{p(x \given \theta)}\)once the value of \(x\) is known, it is seen as the likelihood function of the parameter \(\theta\). Note that this is not a valid probability distribution \(\color{orangered}{[p(x = x \given \theta)]}\).
\(\color{steelblue}{p(\theta)}\)the a priori probability of \(\theta\).
\(\color{purple}{p(\theta \given x)}\)the a posteriori probability of \(\theta\) (knowing \(x\)).
\(\color{green}{p(x)}\)the marginal probability of \(x\) (on \(\theta\)) or “marginal likelihood”, “integrated likelihood”.
In this framework, for each problem, we will follow 3 steps:
Build the model (the history of the data): likelihood + prior.
Update with data, compute (or approximate) the posterior.
Evaluate the model, quality of fit, sensitivity, summarise results, readjust.
Bayesian inference is really just counting and comparing of possibilities […] in order to make good inference about what actually happened, it helps to consider everything that could have happened (McElreath, 2016).
The first idea is that Bayesian inference is reallocation of credibility across possibilities. The second foundational idea is that the possibilities, over which we allocate credibility, are parameter values in meaningful mathematical models (Kruschke, 2015).
\(\color{steelblue}{p(\theta)}\)represents the a priori probability of \(\theta\): what we know about \(\theta\) before observing the data. In this case: \(\Pr(\text{Cancer=+}) = 0.008\) and \(\Pr(\text{Cancer=-}) = 0.992\).
\(\color{orangered}{p(x \given \theta)}\)represents the conditional probability of the data \(x\) knowing the parameter \(\theta\), also known as the likelihood function of the parameter \(\theta\).
\(p(x)\) the marginal probability of \(x\) (over \(\theta\)). This is a constant used to normalise the distribution (the “sum of products” from the previous example).
\(\color{purple}{p(\theta \given x)}\)the a posteriori probability of \(\theta\) given \(x\), that is, what we know about \(\theta\) after seeing \(x\).
(posterior <- (like$"Mam+"* prior ) / marginal )
[1] 0.09394572 0.90605428
Bayesian inference as probabilistic knowledge updating
Before the mammogram, the probability of a woman drawn at random having breast cancer was \(\Pr(\text{Cancer=+}) = 0.008\) (prior). After a positive result, this probability became \(\Pr(\text{Cancer=+} \given \text{Mam=+}) = 0.09\) (posterior). These probabilities are expressions of our knowledge. After a positive mammogram, we still think it’s “very unlikely” to have cancer, but this probability has changed considerably compared with “before the test”.
A Bayesianly justifiable analysis is one that treats known values as observed values of random variables, treats unknown values as unobserved random variables, and calculates the conditional distribution of unknowns given knowns and model specifications using Bayes’ theorem (Rubin, 1984).
Beta-Binomial model
Bernoulli law
Applies to all situations where the data generation process can only result in two mutually exclusive outcomes (e.g., a coin toss). At each trial, if we assume that \(\Pr(\text{heads}) = \theta\), then \(\Pr(\text{tails}) = 1 - \theta\).
Since Bernoulli, we know how to compute the probability of the result of a coin toss, as long as we know the coin’s bias \(theta\). Let’s assume that \(Y = 0\) when you get tails, and that \(Y = 1\) when you get heads. Then \(Y\) is distributed according to a Bernoulli distribution:
If we have a series of independent and identically distributed throws \(\{Y_i\}\) (i.e., each throw has a Bernoulli distribution of probability distribution with probability \(\theta\)), the set of throws can be described by a binomial distribution.
For example, suppose we have the following sequence of five throws: Tails, Tails, Tails, Heads, Heads. We can recode this sequence into \(\{0, 0, 0, 1, 1\}\).
Reminder: The probability of each \(1\) is \(\theta\) and the probability of each \(0\) is \(1 - \theta\).
What is the probability of getting 2 heads out of 5 throws?
Bernoulli process
Knowing that the trials are independent of each other, the probability of obtaining this sequence is \((1 - \theta) \times (1 - \theta) \times (1 - \theta) \times \theta \times \theta\), that is: \(\theta^{2} (1 - \theta)^{3}\).
We can generalise this result for a sequence of \(n\) throws and \(y\) “successes”:
\[\theta^{y} (1 - \theta)^{n - y}\]
So far, we have only considered a single sequence resulting in 2 successes for 5 throws, but there are many sequences that can result in 2 successes for 5 throws (e.g., \(\{0, 0, 1, 0, 1\}\), \(\{0, 1, 1, 0, 0\}\))…
Binomial coefficient
The binomial coefficient allows computing the number of possible combinations resulting in \(y\) successes for \(n\) throws in the following way (read “\(y\) among \(n\)” or “number of combinations of \(y\) among \(n\)”)1:
\[
\left(\begin{array}{l} n \\ y \end{array}\right) = C_{y}^{n} = \frac{n !}{y !(n - y) !}
\]
For instance for \(y = 1\) and \(n = 3\), we know there are 3 possible combinations: \(\{0, 0, 1\}, \{0, 1, 0\}, \{1, 0, 0\}\). We can check this by applying the formula above.
# computing the total number of possible configurations in Rchoose(n =3, k =1)
[1] 3
Binomial distribution
\[
p(y \given \theta) = \Pr(Y = y \given \theta) = \left(\begin{array}{l} n \\ y \end{array}\right) \theta^{y}(1 - \theta)^{n - y}
\] The binomial distribution allows us to calculate the probability of obtaining \(y\) successes on \(n\) trials, for a given \(\theta\). Example of the binomial distribution for an unbiased coin (\(\theta = 0.5\)), indicating the probability of obtaining \(n\) faces in 10 throws (in R: dbinom(x = 0:10, size = 10, prob = 0.5)).
Generating data from a binomial distribution
library(tidyverse)set.seed(666) # for reproducibilitysample(x =c(0, 1), size =500, prob =c(0.4, 0.6), replace =TRUE) %>%# theta = 0.6data.frame() %>%mutate(x =seq_along(.), y =cummean(.) ) %>%ggplot(aes(x = x, y = y) ) +geom_line(lwd =1) +geom_hline(yintercept =0.6, lty =3) +labs(x ="Number of trials", y ="Proportion of heads") +ylim(0, 1)
Defining the model (likelihood)
Likelihood function:
We consider \(y\) to be the number of successes.
We consider the number of observations \(n\) to be a constant.
We consider \(\theta\) to be the parameter of our model (i.e., the probability of success).
The likelihood function is written as:
\[
\color{orangered}{\mathcal{L}(\theta \given y, n) = p(y \given \theta, n) = \left(\begin{array}{l} n \\ y \end{array}\right) \theta^{y}(1 - \theta)^{n - y} \propto \theta^{y}(1 - \theta)^{n - y}}
\]
Likelihood versus probability
A coin with a bias \(\theta\) is tossed again (where \(\theta\) represents the probability of getting heads). This coin is tossed twice and we obtain a Head and a Tails.
The probability of observing a Head on two coin tosses as a function of different values of \(\theta\) can be calculated as follows:
This probability is defined for a fixed data set and a variable value. This function can be represented visually.
Likelihood versus probability
# Graphical representation of the likelihood function for y = 1 and n = 2y <-1# number of headsn <-2# number of trialsdata.frame(theta =seq(from =0, to =1, length.out =1e3) ) %>%mutate(likelihood =dbinom(x = y, size = n, prob = theta) ) %>%ggplot(aes(x = theta, y = likelihood) ) +geom_area(color ="orangered", fill ="orangered", alpha =0.5) +labs(x =expression(paste(theta, " - Pr(head)") ), y ="Likelihood")
Likelihood versus probability
If we calculate the area under the curve of this function, we get:
When we vary \(\theta\), the likelihood function is not a valid probability distribution (i.e., its integral is not equal to 1). We use the term likelihood to distinguish this type of function from probability density functions. We use the following notation to emphasise the fact that the likelihood function is a function of \(\theta\), and that the data are fixed: \(\mathcal{L}(\theta \given \text{data}) = p(\text{data} \given \theta)\).
Likelihood versus probability
Likelihood versus probability for two coin tosses
Number of Heads (y)
theta
0
1
2
Total
0
1.00
0.00
0.00
1
0.2
0.64
0.32
0.04
1
0.4
0.36
0.48
0.16
1
0.6
0.16
0.48
0.36
1
0.8
0.04
0.32
0.64
1
1
0.00
0.00
1.00
1
Total
2.20
1.60
2.20
Note that the likelihood of \(\theta\) for a particular data item is equal to the probability of the data for this value of \(\theta\). However, the distribution of these likelihoods (in columns) is not a probability distribution. In Bayesian analysis, the data are considered fixed and the value of \(\theta\) is considered a random variable.
Defining the model (prior)
How can we define a prior for \(\theta\)?
Semantic aspect\(~\rightarrow~\)the prior should be able to account for:
An absence of information
Knowledge of previous observations concerning this coin
A level of uncertainty concerning these previous observations
Mathematical aspect\(~\rightarrow~\)for a fully analytical solution:
The prior and posterior distributions must have the same form
The marginal likelihood must be calculable analytically
where \(a\) and \(b\) are two parameters such that \(a \geq 0\), \(b \geq 0\), and \(B(a, b)\) is a normalisation constant.
Interpretation of parameters
The absence of prior knowledge can be expressed by setting \(a = b = 1\) (orange distribution).
A prior in favour of an absence of bias can be expressed by setting \(a = b \geq 2\) (green distribution).
A bias in favour of Heads can be expressed by setting \(a > b\) (blue distribution).
A bias in favour of Tails can be expressed by setting \(a < b\) (purple distribution).
Interpretation of parameters
The level of certainty increases with the sum \(\kappa = a + b\).
No idea where the coin comes from: \(a = b = 1\) -> flat prior.
While waiting for the experiment to begin, the coin was tossed 10 times and we observed 5 “Heads”: \(a = b = 5\) -> weakly informative prior.
The coin comes from the Bank of France: \(a = b = 50\) -> strongly informative prior.
Interpretation of parameters
Suppose we have an estimate of the most likely value of the parameter \(\theta\). We can re-parametrise the Beta distribution as a function of the mode \(\omega\) and the level of certainty \(\kappa\):
If \(\omega = 0.65\) and \(\kappa = 25\), then \(p(\theta) = \mathrm{Beta}(\theta \given 15.95, 9.05)\). If \(\omega = 0.65\) and \(\kappa = 10\) then \(p(\theta) = \mathrm{Beta}(\theta \given 6.2, 3.8)\).
Conjugate prior
Formally, if \(\mathcal{F}\) is a class of sampling distributions \(p(y \given \theta)\), and \(\mathcal{P}\) is a class of prior distributions for \(\theta\), then the class \(\mathcal{P}\) is conjugate to \(\mathcal{F}\) if:
\[
p(\theta \given y) \in \mathcal{P} \text{ for all } p(\cdot \given \theta) \in \mathcal{F} \text{ and } p(\cdot) \in \mathcal{P}
\]
(p.35, Gelman et al., 2013). In other words, a prior is called a conjugate prior if, when converted to a posterior by being multiplied by the likelihood, it keeps the same form. In our case, the Beta prior is a conjugate prior for the Binomial likelihood, because the posterior is a Beta distribution as well.
Analytical derivation of the posterior distribution
Assume a prior defined by: \(\ \color{steelblue}{p(\theta \given a, b) = \mathrm{Beta}(a, b) = \frac{\theta^{a - 1}(1 - \theta)^{b - 1}}{B(a, b)} \propto \theta^{a - 1}(1 - \theta)^{b - 1}}\)
Given a likelihood function associated with \(y\) “Heads” for \(n\) throws:
\(\ \color{orangered}{p(y \given n, \theta) = \mathrm{Bin}(y \given n, \theta) = \left(\begin{array}{l} n \\ y \end{array}\right) \theta^{y}(1 - \theta)^{n - y} \propto \theta^{y}(1 - \theta)^{n - y}}\)
Here, we have ignored the constants which do not depend on \(\theta\) (i.e., the number of combinations in the binomial likelihood function and the Beta function \(B(a, b)\) in the Beta prior).1 Taking them into account, we obtain a Beta posterior distribution of the following form:
\[\color{purple}{p(\theta \given y, n) = \mathrm{Beta}(y + a, n - y + b)}\]
An example to help you digest
We observe \(y = 7\) correct answers out of \(n = 10\) questions. We choose a prior \(\mathrm{Beta}(1, 1)\), that is, a uniform prior on \([0, 1]\). This prior is equivalent to a priori knowledge of 0 successes and 0 failures (i.e., a flat prior).
The posterior distribution is given by:
\[
\begin{align}
\color{purple}{p(\theta \given y, n)} &\propto \color{orangered}{p(y \given n, \theta)} \ \color{steelblue}{p(\theta)} \\
&\propto \color{orangered}{\mathrm{Bin}(7 \given 10, \theta)} \ \color{steelblue}{\mathrm{Beta}(\theta \given 1, 1)} \\
&= \color{purple}{\mathrm{Beta}(y + a, n - y + b)} \\
&= \color{purple}{\mathrm{Beta}(8, 4)}
\end{align}
\]
The mean of the posterior distribution is given by:
\[
\color{purple}{\underbrace{\frac{y + a}{n + a + b}}_{posterior}} = \color{orangered}{\underbrace{\frac{y}{n}}_{data}} \underbrace{\frac{n}{n + a + b}}_{weight} + \color{steelblue}{\underbrace{\frac{a}{a + b}}_{prior}} \underbrace{\frac{a + b}{n + a + b}}_{weight}
\]
An example to help you digest
Influence of prior on posterior distribution
Case where \(n < a + b, (n = 10, a = 4, b = 16)\).
Influence of prior on posterior distribution
Case where \(n = a + b, (n = 20, a = 4, b = 16)\).
Influence of prior on posterior distribution
Case where \(n > a + b, (n = 40, a = 4, b = 16)\).
Key points to remember
The posterior distribution is always a compromise between the prior distribution and the likelihood function (Kruschke, 2015).
Key points to remember
The more data we have, the less influence the prior has in estimating the posterior distribution (and vice versa). Warning: When the prior assigns a probability of 0 to certain values of \(\theta\), the model is unable to learn (these values are then considered “impossible”).
New issue: \(p(\text{data})\) is obtained by calculating the sum (for discrete discrete variables) or the integral (for continuous variables) of the joint density \(p(\text{data}, \theta)\) over all possible values of of \(\theta\). This can become tricky when the model includes several continuous parameters…
Let’s consider a model with two discrete parameters. The marginal likelihood is obtained as:
Analytical solution \(\longrightarrow\) Using a conjugate prior (e.g., the Beta-Binomial model).
Discretised solution \(\longrightarrow\) Computing the posterior on a finite set of points (grid method).
Approximate solution \(\longrightarrow\) The parameter space is “cleverly” sampled (e.g., MCMC methods, cf. Course 03).
Discrete distributions
Continuous distributions
Problem: This solution is very restrictive. Ideally, the model (likelihood + prior) model should be defined on the basis of the interpretation of the parameters of these distributions, and not to facilitate the calculations…
Posterior distribution, grid method
Define the grid
Calculate the prior probability for each grid value
Calculate the likelihood for each grid value
Calculate the product of prior x likelihood for each grid value, then normalise
Posterior distribution, grid method
Define the grid
Calculate the prior probability for each grid value
Calculate the likelihood for each grid value
Calculate the product of prior x likelihood for each grid value, then normalise
Posterior distribution, grid method
Define the grid
Calculate the prior probability for each grid value
Calculate the likelihood for each grid value
Calculate the product of prior x likelihood for each grid value, then normalise
Posterior distribution, grid method
Define the grid
Calculate the prior probability for each grid value
Calculate the likelihood for each grid value
Calculate the product of prior x likelihood for each grid value, then normalise
Posterior distribution, grid method
Define the grid
Calculate the prior probability for each grid value
Calculate the likelihood for each grid value
Calculate the product of prior x likelihood for each grid value, then normalise
Posterior distribution, grid method
Problem with the number of parameters… Refining the grid increases the calculation time:
3 parameters with a \(10^3\) grid = \(10^9\) calculation points
10 parameters with a grid of \(10^3\) nodes = \(10^{30}\) calculation points
The best supercomputer (Frontier) performs around \(1.194 \times 10^{18}\) operations per second. If we consider that it would have to perform 4 operations per node of the grid, it would take more time to go through the grid than the estimated age of the universe (approximately \((4.36 ± 0.012) \times 10^{17}\) seconds)…
Another solution: Sampling the posterior distribution
To sample (cleverly) a posterior distribution, we can use different implementations of MCMC methods (e.g., Metropolis-Hastings, Hamiltonian Monte Carlo) which we will discuss in Course 03. In the meantime, we will work with samples from the posterior distribution i) to get used to results from MCMC methods and ii) because it is simpler to compute summary statistics (e.g., mean or credible intervals) on samples rather than by computing integrals.
a <- b <-1# parameters of the Beta priorn <-9# number of observationsy <-6# number of successesp_grid <-seq(from =0, to =1, length.out =1000)posterior <-dbeta(p_grid, y + a, n - y + b) # plot(posterior)
Grid method:
p_grid <-seq(from =0, to =1, length.out =1000)prior <-rep(1, 1000) # uniform priorlikelihood <-dbinom(x = y, size = n, prob = p_grid)posterior <- (likelihood * prior) /sum(likelihood * prior) # plot(posterior)
Sampling the posterior distribution to describe it:
The posterior distribution is described explicitly
The model is strongly constrained (i.e., we have to pick conjugate priors)
Grid method
The posterior distribution is only computed for a finite set of values
The finer the grid, the better the estimate of the posterior distribution
There is an “accuracy - calculation time” trade-off
Using samples to summarise the posterior distribution
Estimation of the central tendency: From a set of samples of samples from a posterior distribution, we can calculate the mean mean, mode, and median. For example, for a uniform prior, 10 coin tosses and 3 heads.
mode_posterior <-find_mode(samples) # in bluemean_posterior <-mean(samples) # in orangemedian_posterior <-median(samples) # in green
Using samples to summarise the posterior distribution
What is the probability that the bias of the coin \(\theta\) is greater than 0.5?
sum(samples >0.5) /length(samples) # equivalent to mean(samples > 0.5)
[1] 0.112
What is the probability that the bias of the coin \(\theta\) is between 0.2 and 0.4?
sum(samples >0.2& samples <0.4) /length(samples)
[1] 0.5482
Highest density interval (HDI)
Properties of the highest density interval:
The HDI indicates the most likely values for the parameter (given the data and the priors)
The narrower the HDI, the greater the degree of certainty
The width of the HDI decreases as the number of measurements increases
Definition: the values of the parameter \(\theta\) contained in an HDI at 89% are such that \(p(\theta) > W\) where \(W\) satisfies the following condition:
This procedure can be used to accept or reject a null value. The region of practical equivalence or region of practical equivalence (ROPE) defines an interval of values that are considered to be “equivalent” to the null value. The figure below summarises the possible decisions resulting from this procedure (Kruschke, 2018).
Region of practical equivalence (ROPE)
The value of the parameter (e.g., \(\theta = 0.5\)) is rejected if the HDI is entirely outside the ROPE. The parameter value (e.g., \(\theta = 0.5\)) is accepted if the HDI is entirely within the ROPE. If the HDI and the ROPE overlap, we cannot conclude…
We tossed coin 200 times and got 115 “Faces”. Is the coin coin biased? We build two models and try to find out which one which one best accounts for the data.
\[\begin{eqnarray*}
\left\{
\begin{array}{ll}
\mathcal{M}_0: Y \sim \mathrm{Binomial}(n, \theta = 0.5) & \quad \text{The coin is not biased}\\
\mathcal{M}_1: Y \sim \mathrm{Binomial}(n, \theta \neq 0.5) & \quad \text{The coin is biased}
\end{array}\right.
\end{eqnarray*}\]
The Bayes factor is the ratio of the marginal likelihoods (of the two models).
This BF indicates that the (prior) odds ratio increased (or should be updated) by 20% in favour of \(\mathcal{M}_{0}\) after observing the data. The Bayes factor can also be interpreted as follows: The data are approximately 1.2 times more likely under the \(\mathcal{M}_{0}\) model than under the \(\mathcal{M}_{1}\) model.
Model checking
The two roles of the likelihood function:
It is a function of \(\theta\) for calculating the posterior distribution: \(\mathcal{L}(\theta \given y, n)\).
When \(\theta\) is known/fixed, it is a probability distribution: \(p(y \given \theta, n) \propto \theta^y(1 - \theta)^{(n - y)}\).
This probability distribution can be used to generate data… !
For example: Generating 10.000 values from a binomial distribution based on 10 coin tosses and a probability of Heads of 0.6:
samples <-rbinom(n =1e4, size =10, prob =0.6)
Model checking
In a Bayesian models, there are two sources of uncertainty when generating predictions:
Uncertainty related to the sampling process -> We draw data from a Binomial pdf
Uncertainty about the value of \(\theta\) -> Our knowledge of \(\theta\) is described by a (posterior) pdf
For example: Generating 10.000 values from a binomial distribution based on 10 coin tosses and a probability of Heads described by the posterior distribution of \(\theta\):
An analyst who works in a factory that makes famous Swedish bread rolls read a book that raised a thorny question… Why does the toast always land on the butter side? Failing to come up with a plausible answer, she set out to verify this assertion. The first experiment was to drop a slice of buttered bread from the height of a table. The results of this experiment are available in the tartine1 dataset from the imsb package.
Retrieving the data
First task: Retrieve the data.
# importing the datadata <-open_data(tartine1)# summary of the datastr(data)
'data.frame': 500 obs. of 2 variables:
$ trial: int 1 2 3 4 5 6 7 8 9 10 ...
$ side : int 1 1 0 1 0 0 1 1 1 0 ...
Questions
Since the toast only has two sides, the result is similar to a draw according to a binomial distribution with an unknown parameter \(\theta\). What is the posterior distribution of the parameter \(\theta\) given these data and given that the analyst had no prior knowledge (you can also use your own prior)?
Calculate the 95% HDI of the posterior distribution and give a graphical representation of the result (hint: use the function imsb::posterior_plot()).
Can the null hypothesis that \(\theta = 0.5\) be rejected? Answer this question using the HDI+ROPE procedure.
Import the tartine2 data from the imsb package. Update the model using the mode of the posterior distribution calculated previously.
Proposed solution - Question 1
Since the toast only has two sides, the result is similar to a draw according to a binomial distribution with an unknown parameter \(\theta\). What is the posterior distribution of the parameter \(\theta\) given these data?
# number of trialsnbTrial <-length(data$trial)# number of "successes" (i.e., when the toast lands on the butter side)nbSuccess <-sum(data$side)# size of the gridgrid_size <-1e3# generating the gridp_grid <-seq(from =0, to =1, length.out = grid_size)# uniform priorprior <-rep(1, grid_size)# computing the likelihodlikelihood <-dbinom(x = nbSuccess, size = nbTrial, prob = p_grid)# computing the posteriorposterior <- likelihood * prior /sum(likelihood * prior)
Proposed solution - Question 2
Calculate the 95% HDI of the posterior distribution and give a graphical representation of the result.
Carnap, R. (1971). Logical foundations of probability (4. impr). Univ. of Chicago Press [u.a.].
Gelman, A., Carlin, J. B., Stern, H., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis, third edition. CRC Press, Taylor & Francis Group.
Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. M., & Woloshin, S. (2007). Helping Doctors and Patients Make Sense of Health Statistics. Psychological Science in the Public Interest, 8(2), 53–96. https://doi.org/10.1111/j.1539-6053.2008.00033.x
Jaynes, E. T. (1986). Bayesian methods: General background.
Kolmogorov, A. N. (1933). Foundations of the theory of probability. New York, USA: Chelsea Publishing Company.
Kruschke, J. K. (2015). Doing Bayesian data analysis: a tutorial with R, JAGS, and Stan (2nd Edition). Academic Press.
Kruschke, J. K. (2018). Rejecting or Accepting Parameter Values in Bayesian Estimation. Advances in Methods and Practices in Psychological Science, 1(2), 270–280. https://doi.org/10.1177/2515245918771304
McElreath, R. (2016). Statistical rethinking: A bayesian course with examples in r and stan. CRC Press/Taylor & Francis Group.
McElreath, R. (2020). Statistical rethinking: A bayesian course with examples in r and stan (2nd ed.). Taylor; Francis, CRC Press.
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4). https://doi.org/10.1214/aos/1176346785